Python basics for summer school-2023, UEF

Linking python to basic data science”

Liang Chen

University of Eastern Finland

2023-08-08

Table of Contents

In this course, we will show you basic syntax of python and some usefull packages regard to data science.

  • What is python
  • Basic Python syntax
    • Variables
    • Basic arithmetic operations
  • Control Flow Statements
    • Conditional statements
    • Loops

  • Function
    • define and call function
  • Usefull modules
    • Numpy
    • Pandas
    • Scipy
    • Geopandas

What is Python?

  • High-level, interpreted programming language.
  • Known for its simplicity and readability.
  • Used for web development, data analysis, artificial intelligence, and more.

What is Python?

Python is a user-frendly language

For exammple, output Hello, World! in C# and Python.

Strong community support and a vast number of libraries (137,000)

namespace ConsoleApp1
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello, World!");
        }
    }
}
print("Hello, World!")

There is a joke: Life is short, you need Python.

Basic Python syntax

One biggest difference of syntax between Python and other languages is that:

  • Python uses indentation rather than’{}’ to control functions or loops or conditioanl statement.

🙅‍♂️

if 5 > 2 {
  print("Five is greater than two!")
}

Basic Python syntax

One biggest difference of syntax between Python and other languages is that:

  • Python uses indentation rather than’{}’ to control functions or loops or conditioanl statement.

🙆‍♂️

if 5 > 2:
  print("Five is greater than two!")

Basic Python syntax

One biggest difference of syntax between Python and other languages is that:

  • Python uses indentation rather than ‘{}’ to control functions or loops or conditioanl statement.
if 5 > 2 {
  print("Five is greater than two!")
}


if 5 > 2:
  print("Five is greater than two!")


How about we remove the indentation of above code:


if 5 > 2:
print("Five is greater than two!")
IndentationError: expected an indented block after 'if' statement on line 1 (2406224367.py, line 2)

Basic Python syntax

Variables

In Python, there are several built-in data types for variables. Here is a table listing some of the common variable types in Python:


Variable Type Description Example
int Integer numbers without decimal points x = 1
float Floating-point numbers with decimal points y = 3.1415926
str Strings (sequences of characters) name = “Teemo”
bool Boolean values (True or False) flag = True
list Ordered, mutable collection of elements numbers = [1, 2, 3]
tuple Ordered, immutable collection of elements coordinates = (10, 20)
set Unordered, mutable collection of unique elements unique_numbers = {1, 2, 3}
dict Collection of key-value pairs person = {‘name’: ‘Alice’, ‘age’: 30}
NoneType Represents the absence of a value no_value = None

Basic Python syntax

Variables

  • left is variable name, right is value.
a = 1
b = 3.1
c = "Teemo"

# or you can assign variables just like follow:

a,b,c = 1,3.1,"Teemo"
print(a)
print(b)
print(c)
1
3.1
Teemo
  • Always give a value when you start to create a variable
d
print(d)
NameError: name 'd' is not defined
'11'

Basic Python syntax

Variables

  • It is possible for converting data of one type to another.
    • Implicit Conversion - automatic type conversion
    • Explicit Conversion - manual type conversion

For example:

# Implicit Conversion
a = 1
print(f"a is {type(a)}")

b = 1.0
print(f"b is {type(b)}")

c = a+b
print(f"c is {type(c)}")
a is <class 'int'>
b is <class 'float'>
c is <class 'float'>

Basic Python syntax

Variables

  • It is possible for converting data of one type to another.
    • Implicit Conversion - automatic type conversion
    • Explicit Conversion - manual type conversion

For example:

# Explicit Conversion
a = 1
print(f"a is {type(a)}")

b = "python"
print(f"b is {type(b)}")

c = a+b
print(f"c is {type(c)}")
a is <class 'int'>
b is <class 'str'>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
c = str(a) + b 
print(f"{c} is {type(c)}")
1python is <class 'str'>

Basic Python syntax

Variables —- list

  • Elements can be mixed with different types
list1 = [1, "2", True, 4]
  • Element can be selected by position index (from 0)
list1 = [1, "2", True, 4]
print(list1[1])
2

Basic Python syntax

Variables —- list


Method Description
append(item) Add an element item to the end of the list.
extend(iterable) Extend the list by appending elements from iterable.
insert(index, item) Insert item at the specified index.
remove(item) Remove the first occurrence of item from the list.
pop(index=-1) Remove and return the element at index.
clear() Remove all elements from the list.
index(item, start, end) Return the index of the first occurrence of item.
count(item) Return the number of occurrences of item.
sort(key, reverse) Sort the list in ascending or descending order.
reverse() Reverse the order of elements in the list.
copy() Create a shallow copy of the list.

Basic Python syntax

Variables —- dict

  • Dictionaries are used to store data values in key:value pairs.
    • ordered
    • changeable
    • not allow duplicates
phone = {
  "brand": "Nokia",
  "model": "N95",
  "year": 2006
}

Basic Python syntax

Variables —- dict

To get keys or values of a dict:

  • directly call the key
phone = {
  "brand": "Nokia",
  "model": "N95",
  "year": 2006
}

print(phone['model'])
#or
print(phone.get('model'))
N95
N95
#get values
print(phone.values())
#get keys
print(phone.keys())
# or both
print(phone.items())
dict_values(['Nokia', 'N95', 2006])
dict_keys(['brand', 'model', 'year'])
dict_items([('brand', 'Nokia'), ('model', 'N95'), ('year', 2006)])

Basic Python syntax

Basic arithmetic operations


Operation Operator Example Result
Addition + 5 + 3 8
Subtraction - 7 - 2 5
Multiplication * 4 * 6 24
Division / 10 / 2 5.0
Floor Division // 10 // 3 3
Exponentiation ** 2 ** 3 8
Modulus % 10 % 3 1

Control Flow Statements

Conditional statements


Condition Mathematical Expression Description
Equal a == b True if a is equal to b; otherwise, False.
Not Equal a != b True if a is not equal to b; otherwise, False.
Greater Than a > b True if a is greater than b; otherwise, False.
Less Than a < b True if a is less than b; otherwise, False.
Greater Than or Equal a >= b True if a is greater than or equal to b; otherwise, False.
Less Than or Equal a <= b True if a is less than or equal to b; otherwise, False.



And usually together with if statement

a = 1
b = 2
if b > a:
  print("b is greater than a")
b is greater than a

Control Flow Statements

Conditional statements

If we want to try multi-conditions, elis and else keywords are needed:

a = 33
b = 33
if b > a:
    print("b is greater than a")
elif a == b:
    print("a and b are equal")
else:
    print("b is less than a")
a and b are equal

Control Flow Statements

Loops

  • for loop
  • while loop
  • Both for and while loops can be linked to:
    • break statement
      • stop the loop before it has looped through all the items
    • continue statement
      • stop current iteration, move to next
    • else statement
      • to specifies codes to be executed when loop is ended

Control Flow Statements

for loop

numbers = [1,2,3,4]
for x in numbers:
    if x == 1:
        continue
    if x == 3:
        continue
    print(x)
else:
    print("loos is over")
2
4
loos is over

with break:

numbers = [1,2,3,4]
for x in numbers:
    if x == 2:
      break
    print(x)
1

Control Flow Statements

while loop

  • Endless execute a set of statements as long as a condition is true.
i = 1
while i < 6:
  print(i)
  i += 1
  • can also work with break and continue
# here 5 is omitted
i = 0
while i < 6:
    i = i + 1
    if i == 3:
        continue
    if i == 4:
        break
    print(i)
1
2

Function

Define and call function

  • The keyword def is used to define function
def say_hello():
    print("hello world!!")
  • Use the function name to call
# define a function
def say_hello():
    print("hello world!!")

# call the function
say_hello()
hello world!!
  • pass parameter
def say_hello(name):
    print(name+" hello world!!")
    
say_hello("Teemo")
Teemo hello world!!

Usefull modules

Numpy

Numpy is an important module for scientific calculation

  • In conda
conda install numpy
  • In pip
pip install numpy
  • import numpy module
import numpy as np
  • Help you for saving time

from:

for (i = 0; i < rows; i++) {
  for (j = 0; j < columns; j++) {
    c[i][j] = a[i][j]*b[i][j];
  }
}

to:

c = a * b

Usefull modules

Numpy

  • Creating arrays
import numpy as np

# 1-dimensional array
arr1d = np.array([1, 2, 3, 4, 5])

# 2-dimensional array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
  • Array shape and dimensions
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.shape)  # (2, 3)
print(arr.ndim)   # 2

Usefull modules

Numpy

import numpy as np

# Create a 1-dimensional NumPy array
arr1d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Slicing: Get elements from index 2 to 5 (exclusive)
sliced_arr1d = arr1d[2:5]
print(sliced_arr1d)  # Output: [2 3 4]

# Slicing: Get all elements from index 4 to the end
sliced_arr1d = arr1d[4:]
print(sliced_arr1d)  # Output: [4 5 6 7 8 9]

# Slicing: Get elements up to index 3 (exclusive)
sliced_arr1d = arr1d[:3]
print(sliced_arr1d)  # Output: [0 1 2]

# Slicing: Get elements from 
# index -5 (5th element from the end) to the end
sliced_arr1d = arr1d[-5:]
print(sliced_arr1d)  # Output: [5 6 7 8 9]
# Create a 2-dimensional NumPy array
arr2d = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

# Slicing: Get the first two rows
sliced_arr2d = arr2d[:2, :]
print(sliced_arr2d)
# Output:
# [[0 1 2]
#  [3 4 5]]

# Slicing: Get the last two columns
sliced_arr2d = arr2d[:, -2:]
print(sliced_arr2d)
# Output:
# [[1 2]
#  [4 5]
#  [7 8]]

# Slicing: Get a sub-matrix from the original array
sliced_arr2d = arr2d[1:, 1:]
print(sliced_arr2d)
# Output:
# [[4 5]
#  [7 8]]

Usefull modules

Numpy

  • Basic array operation
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"origin: \n {arr}")

# Reshape the array to (3, 2)
reshaped_arr = arr.reshape((3, 2))
print(f"reshaped: \n  {reshaped_arr}")
# Transpose the array
transposed_arr = arr.T
print(f"transposed: \n {transposed_arr}")
origin: 
 [[1 2 3]
 [4 5 6]]
reshaped: 
  [[1 2]
 [3 4]
 [5 6]]
transposed: 
 [[1 4]
 [2 5]
 [3 6]]

Usefull modeles

Scipy

  • to provide a wide range of scientific and technical computing capabilities in Python.

For example:

  • Interpolation
Code
import numpy as np
import plotly.graph_objects as go
from scipy import interpolate

# Sample data points
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 6, 3])
# Create an interpolation function
interp_func = interpolate.interp1d(x, y, kind='linear')
interp_func1 = interpolate.interp1d(x, y, kind='cubic')
# Define points where we want to estimate values
x_new = np.linspace(1, 5, num=100)  # Generate 100 points between 1 and 5
# Perform linear & cubic interpolation to estimate 
# y-values for new x-values
y_new = interp_func(x_new)
y_new1 = interp_func1(x_new)
# Plot the original data and the interpolated values
fig = go.Figure()
fig.add_trace(go.Scatter(x = x,y = y,
              mode='markers', name='Origibal data'))
fig.add_trace(go.Scatter(x = x_new,y = y_new,
              mode='lines', name='Linear interpolation'))
fig.add_trace(go.Scatter(x = x_new,y = y_new1,
              mode='lines', name='Cubic interpolation'))
fig.show()

Usefull modeles

Scipy

  • Curve fit
Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Sample data points
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.1, 3.9, 7.1, 8.8, 12.3])

# Define the function to fit (in this example, we'll use a simple quadratic function)
def quadratic_func(x, a, b, c):
    return a * x**2 + b * x + c

# Perform curve fitting
params, cov_matrix = curve_fit(quadratic_func, x, y)

# Get the fitted parameters
a_fit, b_fit, c_fit = params

# Generate new x values for plotting the fitted curve
new_x = np.linspace(1, 5, 100)

# Calculate the y values for the fitted curve
fitted_y = quadratic_func(new_x, a_fit, b_fit, c_fit)

trace_data = go.Scatter(x=x, y=y, mode='markers', name='Original Data')

# Create the trace for the fitted curve
trace_fit = go.Scatter(x=new_x, y=fitted_y, mode='lines', name='Fitted Curve')

# Create the layout for the plot
layout = go.Layout(title='Curve Fitting with Scipy', xaxis=dict(title='x'), yaxis=dict(title='y'))

# Create the figure
fig = go.Figure(data=[trace_data, trace_fit], layout=layout)

# Show the figure
fig.show()

Usefull modeles

Pandas

  • to provide a powerful library for data manipulation, cleaning, exploration, and analysis

For example:

  • Read data from from multi-sources
Data Source Pandas Function Example
CSV pd.read_csv() pd.read_csv('data.csv')
TXT pd.read_csv() pd.read_csv('data.txt', delimiter='\t')
DAT pd.read_csv() pd.read_csv('data.dat', delimiter=' ')
Excel pd.read_excel() pd.read_excel('data.xlsx')
JSON pd.read_json() pd.read_json('data.json')
SQL Database pd.read_sql() pd.read_sql('SELECT * FROM table_name', connection)
Clipboard pd.read_clipboard() pd.read_clipboard()
URL pd.read_csv() pd.read_csv('https://example.com/data.csv')
HTML pd.read_html() pd.read_html('https://example.com/table.html')

Usefull modeles

Pandas

  • Pandas VS. Build-in IO
import pandas as pd

# Read data.txt using Pandas
df = pd.read_csv('data.txt')
with open('data.txt', 'r') as file:
    # Skip the header line
    next(file)
    # Initialize an empty list to store data
    data = []
    # Read and process each line in the file
    for line in file:
        line = line.strip()  # Remove leading/trailing whitespaces
        col1, col2, col3 = line.split(',')  # Split the line using comma as the delimiter
        data.append({'col1': col1, 'col2': col2, 'col3': col3})

# Create a DataFrame from the data list
df = pd.DataFrame(data)

Usefull modeles

Pandas

  • Basic data exploration
print(df.head())     # View the first few rows
print(df.info())     # Summary of DataFrame information
print(df.describe()) # Summary statistics
print(df.shape)      # Number of rows and columns
  • Filtering
df[df['col1'] > 0] # keep all rows based on col1 > 0
  • Groping and aggregation
import pandas as pd

data = {
        'Name': ['1', '1', '1','0','0','0'],
        'Value': [25, 30, 22,12,32,55]
}

df = pd.DataFrame(data)
grouped_df = df.groupby('Name').mean()
print(grouped_df)
          Value
Name           
0     33.000000
1     25.666667

Usefull modeles

Pandas

  • Handling Missing Data:
 # Drop rows with missing values
f.dropna()
# Fill missing values with a specific value (e.g., 0)
df.fillna(0)
# Forward fill missing values with the previous value
df.fillna(method='ffill')
  • Mergeing data
df1 = pd.DataFrame({'ID': [1, 2, 3], 
                    'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 
                    'City': ['New York', 'Los Angeles', 'Chicago']})

merged_df = pd.merge(df1, df2, on='ID')
merged_df
ID Name City
0 2 Bob New York
1 3 Charlie Los Angeles

Usefull modeles

GeoPandas

to provide a convenient and efficient way to work with geospatial data in Python - Load *.shp file

Code
import geopandas as gpd
import matplotlib.pyplot as plt
# Load a shapefile into a GeoDataFrame
gdf = gpd.read_file('krycklan\\5haStreams.shp').to_crs("epsg:4326")

# Display the first few rows of the GeoDataFrame
print(gdf.head())
   ARCID  GRID_CODE  FROM_NODE  TO_NODE  \
0      1          1          4        1   
1      2          1          4        2   
2      3          1          4        6   
3      4          1          6        3   
4      5          1          5        7   

                                            geometry  
0  LINESTRING (19.79398 64.29149, 19.79379 64.291...  
1  LINESTRING (19.79398 64.29149, 19.79421 64.291...  
2  LINESTRING (19.79398 64.29149, 19.79427 64.291...  
3  LINESTRING (19.79486 64.29056, 19.79496 64.290...  
4  LINESTRING (19.78224 64.29183, 19.78235 64.291...  

Usefull modeles

GeoPandas

  • Plot
Code
import geopandas as gpd

gdf = gpd.read_file('krycklan\\5haStreams.shp').to_crs("epsg:4326")
gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Usefull modeles

GeoPandas

Function/Method Description
gpd.read_file() Load geospatial data from file (e.g., shapefile, GeoJSON, GeoPackage).
GeoDataFrame() Create a GeoDataFrame from a DataFrame with geometry column or other geospatial data.
gdf.head() Display the first few rows of the GeoDataFrame.
gdf.plot() Plot the GeoDataFrame using Matplotlib.
gdf.geometry Access the geometry column of the GeoDataFrame.
gdf.crs Get or set the coordinate reference system (CRS) of the GeoDataFrame.
gdf.to_crs() Reproject (transform) the GeoDataFrame to a new CRS.
gdf.bounds Calculate the bounding box of the GeoDataFrame.
gdf.buffer() Create a buffer around the geometries in the GeoDataFrame.
gdf.intersection() Perform spatial intersection between two GeoDataFrames.
gdf.dissolve() Merge geometries based on a common attribute value.
gpd.overlay() Perform spatial overlay operations (intersection, union, difference, etc.) between GeoDataFrames.
gdf.cx[] Spatial filtering based on a bounding box.

Usefull modeles

GeoPandas

Function/Method Description
gdf.intersects() Check if geometries intersect with a specific geometry.
gpd.sjoin() Perform a spatial join between two GeoDataFrames based on their spatial relationship.
gdf.area Calculate the area of geometries in the GeoDataFrame.
gdf.length Calculate the length of geometries (lines) in the GeoDataFrame.
gdf.centroid Get the centroid point of geometries in the GeoDataFrame.
gdf.to_file() Save the GeoDataFrame to a file (e.g., shapefile, GeoJSON, GeoPackage).

Some usefull tutorials

Library Official Documentation
NumPy https://numpy.org/doc/
SciPy https://docs.scipy.org/doc/scipy/reference/
Pandas https://pandas.pydata.org/docs/
GeoPandas https://geopandas.org/en/stable/
Matplotlib https://matplotlib.org/stable/



Thanks 😊


material can be found in my github

liangch@uef.fi